2/14/23
Q: Is there ever a time when we should use \(R^2\) instead of adjusted R^2 when analyzing a model?
A: When talking about variance explained of a single model, \(R^2\) is great, but when comparing across models, you’ll always want to use adjusted \(R^2\)
Q: How do we use the data from OCS for our case study? Should we merge the data files?
A: Excellent question! That’s what today’s lecture is all about. There’s a whole lot of wrangling to do before we can use these data!
Q: Do we need to look at the p-value when we do analysis? (Midterm01)
A: When interpreting a model, no. When doing hypothesis testing (we’ll get there), it is one piece you can look at. A few people did interpret p-values on the midterm, and that’s ok! (But it was not required.)
Q: Can we have a system where we can find other students to group with? Like a google form?
A: Great question! I’ll start a pinned thread on Campuswire so you all can find one another.
Q: I think the data seems pretty confusing.
A: That’s b/c it is! We’ve got a lot of work to do to get it into a usable/understandable format.
Q: In what context of data we should use interaction model or main effect model?
A: Interaction terms should be included when the relationship between one predictor and the outcome varies by another predictor.
Due Dates:
Notes:
Right to Carry (RTC) Laws - “a law that specifies if and how citizens are allowed to have a firearm on their person or nearby (for example, in a citizen’s car) in public.”2
Two contradictory analyses:
John J. Donohue et al., Right‐to‐Carry Laws and Violent Crime: A Comprehensive Assessment Using Panel Data and a State‐Level Synthetic Control Analysis. Journal of Empirical Legal Studies, 16,2 (2019).
David B. Mustard & John Lott. Crime, Deterrence, and Right-to-Carry Concealed Handguns. Coase-Sandor Institute for Law & Economics Working Paper No. 41, (1996).
There are a whole bunch of different data files we’ll be using…
👉 Your Turn: Load the data into RStudio. It will take a while…so just let it get started.
Get two datasets (Lott, Donohue) that contain demographic, population, police staffing, unemployment, violent crime, RTC, and poverty information at the state level across time.
❓ What would be the tidy way to store these data?
🧠 Take a look in one of the data folders, open at least one of the data files to view it, and try to get a sense of the type of information contained within it.
dem_77_79 <- read_csv("data/raw/Demographics/Decade_1970/pe-19.csv", skip = 5)
dem_80_89 <- list.files(recursive = TRUE,
path = "data/raw/Demographics/Decade_1980",
pattern = "*.csv",
full.names = TRUE) |>
purrr::map(~read_csv(., skip=5))
dem_90_99 <- list.files(recursive = TRUE,
path = "data/raw/Demographics/Decade_1990",
pattern = "*.txt",
full.names = TRUE) |>
map(~read_table2(., skip = 14))
dem_00_10 <- list.files(recursive = TRUE,
path = "data/raw/Demographics/Decade_2000",
pattern = "*.csv",
full.names = TRUE) |>
map(~read_csv(.))Source: US Census Bureau Data
Rows: 62,244
Columns: 21
$ REGION <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ DIVISION <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ STATE <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
$ NAME <chr> "United States", "United States", "United States", "…
$ SEX <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ORIGIN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ RACE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ AGEGRP <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ ESTIMATESBASE2000 <dbl> 281424600, 19176154, 20549855, 20528425, 20218782, 1…
$ POPESTIMATE2000 <dbl> 282162411, 19178293, 20463852, 20637696, 20294955, 1…
$ POPESTIMATE2001 <dbl> 284968955, 19298217, 20173362, 20978678, 20456284, 1…
$ POPESTIMATE2002 <dbl> 287625193, 19429192, 19872417, 21261421, 20610370, 2…
$ POPESTIMATE2003 <dbl> 290107933, 19592446, 19620851, 21415353, 20797166, 2…
$ POPESTIMATE2004 <dbl> 292805298, 19785885, 19454237, 21411680, 21102552, 2…
$ POPESTIMATE2005 <dbl> 295516599, 19917400, 19389067, 21212579, 21486214, 2…
$ POPESTIMATE2006 <dbl> 298379912, 19938883, 19544688, 21033138, 21807709, 2…
$ POPESTIMATE2007 <dbl> 301231207, 20125962, 19714611, 20841042, 22067816, 2…
$ POPESTIMATE2008 <dbl> 304093966, 20271127, 19929602, 20706655, 22210880, 2…
$ POPESTIMATE2009 <dbl> 306771529, 20244518, 20182499, 20660564, 22192810, 2…
$ CENSUS2010POP <dbl> 308745538, 20201362, 20348657, 20677194, 22040343, 2…
$ POPESTIMATE2010 <dbl> 309349689, 20200529, 20382409, 20694011, 21959087, 2…
Rows: 64
Columns: 4
$ Region <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",…
$ Division <chr> "0", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2",…
$ `State\n(FIPS)` <chr> "00", "00", "09", "23", "25", "33", "44", "50", "00", …
$ Name <chr> "Northeast Region", "New England Division", "Connectic…
STATE_FIPS <- STATE_FIPS |>
rename(STATEFP = `State\n(FIPS)`,
STATE = Name) |>
select(STATEFP, STATE) |>
filter(STATEFP != "00")
STATE_FIPS# A tibble: 51 × 2
STATEFP STATE
<chr> <chr>
1 09 Connecticut
2 23 Maine
3 25 Massachusetts
4 33 New Hampshire
5 44 Rhode Island
6 50 Vermont
7 34 New Jersey
8 36 New York
9 42 Pennsylvania
10 17 Illinois
# … with 41 more rows
There’s an issue currently with the ps_data file from OCS, so we’ll use this file instead:
ue_rate_data <- list.files(recursive = TRUE,
path = "data/raw/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) |>
map(~read_xlsx(., skip = 10))
ue_rate_names <- list.files(recursive = TRUE,
path = "data/raw/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) |>
map(~read_xlsx(., range = "B4:B6")) |>
map(c(1,2)) |>
unlist()
names(ue_rate_data) <- ue_rate_names$Alabama
# A tibble: 44 × 14
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1977 7.5 9 7.7 7.2 6.8 8.6 8 7.8 6.7 6.3 6.3 6
2 1978 7.1 6.9 6.2 5.4 5.1 6.9 6.7 6.7 6.5 6.3 6.3 6.5
3 1979 6.7 7.5 6.9 6.6 6.4 8.4 7.7 7.8 7.1 7.2 6.9 6.7
4 1980 7.7 7.8 7.4 7.4 8.4 9.7 10.4 10.3 9.3 9.6 9.4 9
5 1981 10 10.3 9.5 9.1 9.4 11.1 10.4 10.9 10.8 11.7 11.5 11.8
6 1982 13.2 13.2 12.9 12.6 12.8 14.5 14.7 14.8 14.7 15.1 15.4 15.3
7 1983 16 16 14.5 13.7 13.3 14.6 13.9 13.8 13.2 12.8 12.1 11.8
8 1984 12.5 12.4 11.4 10.8 10.1 11.3 11.5 11.3 10.8 10.2 9.7 10.1
9 1985 10.7 10.5 9.8 8.7 8.4 9.6 9.2 8.8 8.6 8.6 8.4 8.7
10 1986 9.3 10.4 10.1 9.4 9.4 10.5 9.7 9.6 9.7 9.7 9.6 9
# … with 34 more rows, and 1 more variable: Annual <dbl>
ue_rate_data <- ue_rate_data |>
map_df(bind_rows, .id = "STATE") |>
select(STATE, Year, Annual) |>
rename("YEAR" = Year,
"VALUE" = Annual) |>
mutate(VARIABLE = "Unemployment_rate")
ue_rate_data# A tibble: 2,244 × 4
STATE YEAR VALUE VARIABLE
<chr> <dbl> <dbl> <chr>
1 Alabama 1977 7.3 Unemployment_rate
2 Alabama 1978 6.4 Unemployment_rate
3 Alabama 1979 7.2 Unemployment_rate
4 Alabama 1980 8.9 Unemployment_rate
5 Alabama 1981 10.6 Unemployment_rate
6 Alabama 1982 14.1 Unemployment_rate
7 Alabama 1983 13.8 Unemployment_rate
8 Alabama 1984 11 Unemployment_rate
9 Alabama 1985 9.2 Unemployment_rate
10 Alabama 1986 9.7 Unemployment_rate
# … with 2,234 more rows
# A tibble: 6 × 6
`NOTE: Number in thousands.` ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA>
2 STATE Total Number "Standard\nerror" Percent "Sta…
3 Alabama 4877 779 "65" 16 "1.3"
4 Alaska 720 94 "9" 13.1 "1.2"
5 Arizona 7241 929 "80" 12.80000000… "1.1…
6 Arkansas 2912 462 "38" 15.9 "1.3"
Source: US Census Bureau Data
Due to spaces and / in the column names, read_lines() from the readr package works better than read_csv()
[1] "Estimated crime in Alabama"
[2] ",,National or state crime,,,,,,,"
[3] ",,Violent crime,,,,,,,"
[4] "Year,Population,Violent crime total,Murder and nonnegligent Manslaughter,Legacy rape /1,Revised rape /2,Robbery,Aggravated assault,"
[5] "1977, 3690000, 15293, 524, 929,, 3572, 10268 "
[6] "1978, 3742000, 15682, 499, 954,, 3708, 10521 "
[1] " NBER WORKING PAPER SERIES\n\n\n\n RIGHT-TO-CARRY LAWS AND VIOLENT CRIME:\n A COMPREHENSIVE ASSESSMENT USING PANEL DATA AND\n A STATE-LEVEL SYNTHETIC CONTROL ANALYSIS\n\n John J. Donohue\n Abhay Aneja\n Kyle D. Weber\n\n Working Paper 23510\n http://www.nber.org/papers/w23510\n\n\n NATIONAL BUREAU OF ECONOMIC RESEARCH\n 1050 Massachusetts Avenue\n Cambridge, MA 02138\n June 2017, Revised November 2018\n\n\n\nPreviously circulated as \"Right-to-Carry Laws and Violent Crime: A Comprehensive Assessment\nUsing Panel Data and a State-Level Synthetic Controls Analysis.\" We thank Dan Ho, Stefano\nDellaVigna, Rob Tibshirani, Trevor Hastie, StefanWager, Jeff Strnad, and participants at the\n2011 Conference of Empirical Legal Studies (CELS), 2012 American Law and Economics\nAssociation (ALEA) Annual Meeting, 2013 Canadian Law and Economics Association (CLEA)\nAnnual Meeting, 2015 NBER Summer Institute (Crime), and the Stanford Law School faculty\nworkshop for their comments and helpful suggestions. Financial support was provided by\nStanford Law School. We are indebted to Alberto Abadie, Alexis Diamond, and Jens\nHainmueller for their work developing the synthetic control algorithm and programming the Stata\nmodule used in this paper and for their helpful comments. The authors would also like to thank\nAlex Albright, Andrew Baker, Jacob Dorn, Bhargav Gopal, Crystal Huang, Mira Korb, Haksoo\nLee, Isaac Rabbani, Akshay Rao, Vikram Rao, Henrik Sachs and Sidharth Sah who provided\nexcellent research assistance, as well as Addis O’Connor and Alex Chekholko at the Research\nComputing division of Stanford’s Information Technology Services for their technical support.\nThe views expressed herein are those of the author and do not necessarily reflect the views of the\nNational Bureau of Economic Research.\n\nNBER working papers are circulated for discussion and comment purposes. They have not been\npeer-reviewed or been subject to the review by the NBER Board of Directors that accompanies\nofficial NBER publications.\n\n© 2017 by John J. Donohue, Abhay Aneja, and Kyle D. Weber. All rights reserved. Short\nsections of text, not to exceed two paragraphs, may be quoted without explicit permission\nprovided that full credit, including © notice, is given to the source.\n"
dem_77_79 <- dem_77_79 |>
rename("race_sex" =`Race/Sex Indicator`) |>
mutate(SEX = str_extract(race_sex, "male|female"),
RACE = str_extract(race_sex, "Black|White|Other"))|>
select(-`FIPS State Code`, -`race_sex`) |>
rename("YEAR" = `Year of Estimate`,
"STATE" = `State Name`) |>
filter(YEAR %in% 1977:1979)
dem_77_79 <- dem_77_79 |>
pivot_longer(cols=contains("years"),
names_to = "AGE_GROUP",
values_to = "SUB_POP")
glimpse(dem_77_79)Rows: 16,524
Columns: 6
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, …
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ SEX <chr> "male", "male", "male", "male", "male", "male", "male", "mal…
$ RACE <chr> "White", "White", "White", "White", "White", "White", "White…
$ AGE_GROUP <chr> "Under 5 years", "5 to 9 years", "10 to 14 years", "15 to 19…
$ SUB_POP <dbl> 98814, 113365, 123107, 135343, 126053, 111547, 100674, 81038…
dem_80_89 <- dem_80_89 |>
map_df(bind_rows)
dem_80_89 <- dem_80_89 |>
rename("race_sex" =`Race/Sex Indicator`) |>
mutate(SEX = str_extract(race_sex, "male|female"),
RACE = str_extract(race_sex, "Black|White|Other"))|>
select( -`race_sex`) |>
rename("YEAR" = `Year of Estimate`) |>
rename("STATEFP_temp" = "FIPS State and County Codes") |>
mutate(STATEFP = str_sub(STATEFP_temp, start = 1, end = 2)) |>
left_join(STATE_FIPS, by = "STATEFP") |>
select(-STATEFP)
dem_80_89 <- dem_80_89 |>
pivot_longer(cols=contains("years"),
names_to = "AGE_GROUP",
values_to = "SUB_POP_temp") |>
group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) |>
summarize(SUB_POP = sum(SUB_POP_temp), .groups="drop")
dem_80_89# A tibble: 55,080 × 6
YEAR STATE AGE_GROUP SEX RACE SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years female Black 50108
2 1980 Alabama 10 to 14 years female Other 805
3 1980 Alabama 10 to 14 years female White 109066
4 1980 Alabama 10 to 14 years male Black 50768
5 1980 Alabama 10 to 14 years male Other 826
6 1980 Alabama 10 to 14 years male White 115988
7 1980 Alabama 15 to 19 years female Black 58428
8 1980 Alabama 15 to 19 years female Other 743
9 1980 Alabama 15 to 19 years female White 126783
10 1980 Alabama 15 to 19 years male Black 56808
# … with 55,070 more rows
dem_90_99 <- dem_90_99 |>
map_df(bind_rows)
colnames(dem_90_99) <- c("YEAR", "STATEFP", "Age", "NH_W_M", "NH_W_F", "NH_B_M",
"NH_B_F", "NH_AIAN_M", "NH_AIAN_F", "NH_API_M", "NH_API_F",
"H_W_M", "H_W_F", "H_B_M", "H_B_F", "H_AIAN_M", "H_AIAN_F",
"H_API_M", "H_API_F")
dem_90_99 <- dem_90_99 |>
drop_na() |>
mutate(W_M = NH_W_M + H_W_M, W_F = NH_W_F + H_W_F,
B_M = NH_B_M + H_B_M, B_F = NH_B_F + H_B_F,
AIAN_M = NH_AIAN_M + H_AIAN_M, AIAN_F = NH_AIAN_F + H_AIAN_F,
API_M = NH_API_M + H_API_M, API_F = NH_API_F + H_API_F) |>
select(-starts_with("NH_"), -starts_with("H_"))
dem_90_99 <- dem_90_99 |>
mutate(AGE_GROUP = cut(Age,
breaks = seq(0,90, by=5),
right = FALSE, labels = pull(distinct(dem_77_79,AGE_GROUP), AGE_GROUP))) |>
select(-Age) |>
pivot_longer(cols = c(starts_with("W_"),
starts_with("B_"),
starts_with("AIAN_"),
starts_with("API_")),
names_to = "RACE",
values_to = "SUB_POP_temp") |>
mutate(SEX = case_when(str_detect(RACE, "_M") ~ "Male",
TRUE ~ "Female"),
RACE = case_when(str_detect(RACE, "W_") ~ "White",
str_detect(RACE, "B_") ~ "Black",
TRUE ~ "Other"))
dem_90_99 <- dem_90_99 |>
left_join(STATE_FIPS, by = "STATEFP") |>
select(-STATEFP) |>
group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) |>
summarize(SUB_POP = sum(SUB_POP_temp), .groups="drop")
glimpse(dem_90_99)Rows: 55,080
Columns: 6
$ YEAR <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, …
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ AGE_GROUP <fct> Under 5 years, Under 5 years, Under 5 years, Under 5 years, …
$ SEX <chr> "Female", "Female", "Female", "Male", "Male", "Male", "Femal…
$ RACE <chr> "Black", "Other", "White", "Black", "Other", "White", "Black…
$ SUB_POP <dbl> 45377, 1406, 92380, 46635, 1360, 98524, 46067, 1698, 92530, …
dem_00_10 <- dem_00_10 |>
map_df(bind_rows)
dem_00_10 <- dem_00_10 |>
select(-ESTIMATESBASE2000,-CENSUS2010POP) |>
filter(NAME != "United States",
SEX != 0,
RACE != 0,
AGEGRP != 0,
ORIGIN == 0) |>
select(-REGION, -DIVISION, -ORIGIN, -STATE) |>
rename("STATE" = NAME,
"AGE_GROUP" = AGEGRP)
dem_00_10 <- dem_00_10 |>
mutate(SEX = factor(SEX, levels = 1:2, labels = c("Male", "Female")),
RACE = factor(RACE, levels = 1:6, labels = c("White", "Black", rep("Other",4))),
AGE_GROUP = factor(AGE_GROUP, levels = 1:18,
labels = pull(distinct(dem_77_79,AGE_GROUP), AGE_GROUP)))
dem_00_10 <- dem_00_10 |>
pivot_longer(cols=contains("ESTIMATE"), names_to = "YEAR", values_to = "SUB_POP_temp") |>
mutate(YEAR = str_sub(YEAR, start=-4),
YEAR = as.numeric(YEAR)) |>
group_by(YEAR, AGE_GROUP, STATE, SEX, RACE) |>
summarize(SUB_POP = sum(SUB_POP_temp), .groups = "drop")
glimpse(dem_00_10)Rows: 60,588
Columns: 6
$ YEAR <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
$ AGE_GROUP <fct> Under 5 years, Under 5 years, Under 5 years, Under 5 years, …
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ SEX <fct> Male, Male, Male, Female, Female, Female, Male, Male, Male, …
$ RACE <fct> White, Black, Other, White, Black, Other, White, Black, Othe…
$ SUB_POP <dbl> 99527, 46595, 4487, 94473, 45672, 4431, 14765, 1039, 8572, 1…
pop_77_79 <- dem_77_79 |>
group_by(YEAR, STATE) |>
summarize(TOT_POP = sum(SUB_POP), .groups = "drop")
pop_77_79 # A tibble: 153 × 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1977 Alabama 3782571
2 1977 Alaska 397220
3 1977 Arizona 2427296
4 1977 Arkansas 2207195
5 1977 California 22350332
6 1977 Colorado 2696179
7 1977 Connecticut 3088745
8 1977 Delaware 594815
9 1977 District of Columbia 681766
10 1977 Florida 8888806
# … with 143 more rows
dem_77_79 <- dem_77_79 |>
left_join(pop_77_79, by = c("YEAR", "STATE")) |>
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) |>
select(-SUB_POP, -TOT_POP) |>
mutate(SEX = str_to_title(SEX))
dem_77_79# A tibble: 16,524 × 6
YEAR STATE SEX RACE AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1977 Alabama Male White Under 5 years 2.61
2 1977 Alabama Male White 5 to 9 years 3.00
3 1977 Alabama Male White 10 to 14 years 3.25
4 1977 Alabama Male White 15 to 19 years 3.58
5 1977 Alabama Male White 20 to 24 years 3.33
6 1977 Alabama Male White 25 to 29 years 2.95
7 1977 Alabama Male White 30 to 34 years 2.66
8 1977 Alabama Male White 35 to 39 years 2.14
9 1977 Alabama Male White 40 to 44 years 1.98
10 1977 Alabama Male White 45 to 49 years 2.02
# … with 16,514 more rows
dem_90_99 <- dem_90_99 |>
left_join(pop_90_99, by = c("YEAR", "STATE")) |>
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) |>
select(-SUB_POP, -TOT_POP)
dem_90_99# A tibble: 55,080 × 6
YEAR STATE AGE_GROUP SEX RACE PERC_SUB_POP
<dbl> <chr> <fct> <chr> <chr> <dbl>
1 1990 Alabama Under 5 years Female Black 1.12
2 1990 Alabama Under 5 years Female Other 0.0347
3 1990 Alabama Under 5 years Female White 2.28
4 1990 Alabama Under 5 years Male Black 1.15
5 1990 Alabama Under 5 years Male Other 0.0336
6 1990 Alabama Under 5 years Male White 2.43
7 1990 Alabama 5 to 9 years Female Black 1.14
8 1990 Alabama 5 to 9 years Female Other 0.0419
9 1990 Alabama 5 to 9 years Female White 2.29
10 1990 Alabama 5 to 9 years Male Black 1.16
# … with 55,070 more rows
dem_00_10 <- dem_00_10 |>
left_join(pop_00_10, by = c("YEAR", "STATE")) |>
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) |>
select(-SUB_POP, -TOT_POP)
dem_00_10# A tibble: 60,588 × 6
YEAR AGE_GROUP STATE SEX RACE PERC_SUB_POP
<dbl> <fct> <chr> <fct> <fct> <dbl>
1 2000 Under 5 years Alabama Male White 2.24
2 2000 Under 5 years Alabama Male Black 1.05
3 2000 Under 5 years Alabama Male Other 0.101
4 2000 Under 5 years Alabama Female White 2.12
5 2000 Under 5 years Alabama Female Black 1.03
6 2000 Under 5 years Alabama Female Other 0.0995
7 2000 Under 5 years Alaska Male White 2.35
8 2000 Under 5 years Alaska Male Black 0.165
9 2000 Under 5 years Alaska Male Other 1.37
10 2000 Under 5 years Alaska Female White 2.26
# … with 60,578 more rows
❗ This would be a good part of the code to write a user-defined function…
The general syntax for a function in R is:
Note: by default the last object created within the function is returned from the function
combine_demo_pop <- function(df_dem, df_pop){
df_dem <- df_dem |>
group_by(YEAR, STATE) |>
summarize(TOT_POP = sum(SUB_POP), .groups = "drop")
df_dem |>
left_join(df_pop, by = c("YEAR", "STATE")) |>
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) |>
select(-SUB_POP, -TOT_POP) |>
mutate(SEX = str_to_title(SEX))
}Rows: 187,272
Columns: 6
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 197…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A…
$ SEX <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "…
$ RACE <chr> "White", "White", "White", "White", "White", "White", "Wh…
$ AGE_GROUP <chr> "Under 5 years", "5 to 9 years", "10 to 14 years", "15 to…
$ PERC_SUB_POP <dbl> 2.6123502, 2.9970356, 3.2545853, 3.5780690, 3.3324688, 2.…
DONOHUE_AGE_GROUPS <- c("15 to 19 years",
"20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years")
dem_DONOHUE <- dem |>
filter(AGE_GROUP %in% DONOHUE_AGE_GROUPS,
SEX == "Male") |>
mutate(AGE_GROUP = fct_collapse(AGE_GROUP, "20 to 39 years"=c("20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years")),
AGE_GROUP = str_replace_all(string = AGE_GROUP,
pattern = " ",
replacement = "_")) |>
group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) |>
summarize(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop") |>
unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") |>
rename("VALUE" = PERC_SUB_POP)
dem_DONOHUE# A tibble: 10,404 × 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Black_Male_15_to_19_years 1.55
2 1977 Alabama Black_Male_20_to_39_years 3.04
3 1977 Alabama Other_Male_15_to_19_years 0.0178
4 1977 Alabama Other_Male_20_to_39_years 0.0642
5 1977 Alabama White_Male_15_to_19_years 3.58
6 1977 Alabama White_Male_20_to_39_years 11.1
7 1977 Alaska Black_Male_15_to_19_years 0.163
8 1977 Alaska Black_Male_20_to_39_years 0.968
9 1977 Alaska Other_Male_15_to_19_years 1.12
10 1977 Alaska Other_Male_20_to_39_years 2.73
# … with 10,394 more rows
LOTT_AGE_GROUPS_NULL <- c("Under 5 years",
"5 to 9 years")
dem_LOTT <- dem |>
filter(!(AGE_GROUP %in% LOTT_AGE_GROUPS_NULL) )|>
mutate(AGE_GROUP = fct_collapse(AGE_GROUP,
"10 to 19 years"=c("10 to 14 years", "15 to 19 years"),
"20 to 29 years"=c("20 to 24 years", "25 to 29 years"),
"30 to 39 years"=c("30 to 34 years", "35 to 39 years"),
"40 to 49 years"=c("40 to 44 years", "45 to 49 years"),
"50 to 64 years"=c("50 to 54 years", "55 to 59 years",
"60 to 64 years"),
"65 years and over"=c("65 to 69 years", "70 to 74 years",
"75 to 79 years", "80 to 84 years",
"85 years and over")),
AGE_GROUP = str_replace_all(AGE_GROUP, " ", "_")) |>
group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) |>
summarize(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop") |>
unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") |>
rename("VALUE" = PERC_SUB_POP)
glimpse(dem_LOTT)Rows: 62,424
Columns: 4
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alaba…
$ VARIABLE <chr> "Black_Female_10_to_19_years", "Black_Female_20_to_29_years",…
$ VALUE <dbl> 3.01067713, 2.32860137, 1.29295656, 1.18231753, 1.73263106, 1…
# A tibble: 2,242 × 3
data_year state_abbr officer_state_total
<dbl> <chr> <dbl>
1 1977 AK 544
2 1977 AL 7380
3 1977 AR 3344
4 1977 AS 0
5 1977 AZ 6414
6 1977 CA 65596
7 1977 CO 7337
8 1977 CT 6051
9 1977 CZ 0
10 1977 DC 4751
# … with 2,232 more rows
state abbreviationsstate_abb_data <- tibble("state_abbr" = state.abb, "STATE" = state.name)
state_abb_data <- state_abb_data |>
mutate(state_abbr = str_replace(string = state_abbr,
pattern = "NE",
replacement = "NB")) |>
add_row(state_abbr = "DC", STATE = "District of Columbia")
ps_data <- ps_data |>
left_join(state_abb_data, by = "state_abbr") |>
select(-state_abbr) |>
rename(YEAR = "data_year",
VALUE = "officer_state_total") |>
mutate(VARIABLE = "officer_state_total")
ps_data# A tibble: 1,938 × 4
YEAR VALUE STATE VARIABLE
<dbl> <dbl> <chr> <chr>
1 1977 544 Alaska officer_state_total
2 1977 7380 Alabama officer_state_total
3 1977 3344 Arkansas officer_state_total
4 1977 6414 Arizona officer_state_total
5 1977 65596 California officer_state_total
6 1977 7337 Colorado officer_state_total
7 1977 6051 Connecticut officer_state_total
8 1977 4751 District of Columbia officer_state_total
9 1977 1018 Delaware officer_state_total
10 1977 24588 Florida officer_state_total
# … with 1,928 more rows
denominator_temp <- population_data |>
select(-VARIABLE) |>
rename("Population_temp"=VALUE)
ps_data <- ps_data |>
left_join(denominator_temp, by=c("STATE","YEAR")) |>
mutate(VALUE = (VALUE * 100000) / Population_temp) |>
mutate(VARIABLE = "police_per_100k_lag") |>
select(-Population_temp)
head(ps_data)# A tibble: 6 × 4
YEAR VALUE STATE VARIABLE
<dbl> <dbl> <chr> <chr>
1 1977 137. Alaska police_per_100k_lag
2 1977 195. Alabama police_per_100k_lag
3 1977 152. Arkansas police_per_100k_lag
4 1977 264. Arizona police_per_100k_lag
5 1977 293. California police_per_100k_lag
6 1977 272. Colorado police_per_100k_lag
# A tibble: 5 × 6
`NOTE: Number in thousands.` ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA>
2 STATE Total Number "Standard\nerror" Percent "Sta…
3 Alabama 4877 779 "65" 16 "1.3"
4 Alaska 720 94 "9" 13.1 "1.2"
5 Arizona 7241 929 "80" 12.80000000… "1.1…
colnames(poverty_rate_data) <- c("STATE", "Total", "Number", "Number_se",
"Percent", "Percent_se")
poverty_rate_data <- poverty_rate_data |>
filter(STATE != "STATE") |>
mutate(length_state = map_dbl(STATE, str_length)) |> # determine how long string in "STATE" column is
filter(length_state < 100) |> # filter to only include possible state lengths
mutate(STATE = str_replace(STATE, pattern = "D.C.",
replacement = "District of Columbia" ))
year_values <- poverty_rate_data |>
filter(str_detect(STATE, "[:digit:]")) |>
distinct(STATE)
year_values <- rep(pull(year_values, STATE), each = 52) # repeat values from STATE column 52 times each
poverty_rate_data <- poverty_rate_data |>
mutate(year_value = year_values) |>
select(-length_state) |>
filter(str_detect(STATE, "[:alpha:]"))
poverty_rate_data <- poverty_rate_data |>
filter(year_value != "2017") |>
filter(year_value != "2013 (18)") |>
mutate(YEAR = str_sub(year_value, start = 1, end = 4)) |>
select(-c(Number, Number_se, Percent_se, Total, year_value)) |>
rename("VALUE" = Percent) |>
mutate(VARIABLE = "Poverty_rate",
YEAR = as.numeric(YEAR),
VALUE = as.numeric(VALUE))
poverty_rate_data# A tibble: 1,989 × 4
STATE VALUE YEAR VARIABLE
<chr> <dbl> <dbl> <chr>
1 Alabama 16 2018 Poverty_rate
2 Alaska 13.1 2018 Poverty_rate
3 Arizona 12.8 2018 Poverty_rate
4 Arkansas 15.9 2018 Poverty_rate
5 California 11.9 2018 Poverty_rate
6 Colorado 9.1 2018 Poverty_rate
7 Connecticut 10.2 2018 Poverty_rate
8 Delaware 7.4 2018 Poverty_rate
9 District of Columbia 14.7 2018 Poverty_rate
10 Florida 13.7 2018 Poverty_rate
# … with 1,979 more rows
crime_data <- crime_data[-((str_which(crime_data, "The figures shown in this column for the offense of rape were estimated using the legacy UCR definition of rape")-1): length(crime_data)+1)]
n_rows <- 2014-1977+1 # determine how many rows there are for each state
rep_cycle <- 4 + n_rows
rep_cycle_cut <- 2 + n_rows
colnames_crime <- (crime_data[4])
# specify which rows are to be deleted based on the file format
delete_rows <- c(seq(from = 2,
to = length(crime_data),
by = rep_cycle),
seq(from = 3,
to = length(crime_data),
by = rep_cycle),
seq(from = 4,
to = length(crime_data),
by = rep_cycle))
sort(delete_rows) # which rows are to be deleted [1] 2 3 4 44 45 46 86 87 88 128 129 130 170 171 172
[16] 212 213 214 254 255 256 296 297 298 338 339 340 380 381 382
[31] 422 423 424 464 465 466 506 507 508 548 549 550 590 591 592
[46] 632 633 634 674 675 676 716 717 718 758 759 760 800 801 802
[61] 842 843 844 884 885 886 926 927 928 968 969 970 1010 1011 1012
[76] 1052 1053 1054 1094 1095 1096 1136 1137 1138 1178 1179 1180 1220 1221 1222
[91] 1262 1263 1264 1304 1305 1306 1346 1347 1348 1388 1389 1390 1430 1431 1432
[106] 1472 1473 1474 1514 1515 1516 1556 1557 1558 1598 1599 1600 1640 1641 1642
[121] 1682 1683 1684 1724 1725 1726 1766 1767 1768 1808 1809 1810 1850 1851 1852
[136] 1892 1893 1894 1934 1935 1936 1976 1977 1978 2018 2019 2020 2060 2061 2062
[151] 2102 2103 2104
[1] ",,National or state crime,,,,,,,"
[2] ",,Violent crime,,,,,,,"
[3] "Year,Population,Violent crime total,Murder and nonnegligent Manslaughter,Legacy rape /1,Revised rape /2,Robbery,Aggravated assault,"
crime_data <- crime_data[-delete_rows]
# extract state labels from data
state_labels <- crime_data[str_which(crime_data, "Estimated crime in ")]
state_labels <- str_remove(state_labels, pattern = "Estimated crime in ")
state_label_order <- rep(state_labels, each = n_rows) # repeat n_rows times
crime_data <- crime_data[-str_which(crime_data, "Estimated crime")]
crime_data_sep <- read_csv(I(crime_data), col_names = FALSE) |>
select(-X6) # remove random extra-comma column
# get column names for later
colnames(crime_data_sep) <- c("Year",
"Population",
"Violent_crime_total",
"Murder_and_nonnegligent_Manslaughter",
"Legacy_rape",
"Revised_rape",
"Robbery",
"Aggravated_assault")
# add column names in
crime_data_sep <- bind_cols(STATE = state_label_order, crime_data_sep)
crime_data <- crime_data_sep |>
mutate(VARIABLE = "Viol_crime_count") |>
rename("VALUE" = Violent_crime_total) |>
rename("YEAR" = Year) |>
select(YEAR,STATE, VARIABLE, VALUE)
crime_data# A tibble: 1,938 × 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Viol_crime_count 15293
2 1978 Alabama Viol_crime_count 15682
3 1979 Alabama Viol_crime_count 15578
4 1980 Alabama Viol_crime_count 17320
5 1981 Alabama Viol_crime_count 18423
6 1982 Alabama Viol_crime_count 17653
7 1983 Alabama Viol_crime_count 16471
8 1984 Alabama Viol_crime_count 17204
9 1985 Alabama Viol_crime_count 18398
10 1986 Alabama Viol_crime_count 22616
# … with 1,928 more rows
chr " Table A1: RTC Adoption Dates\n\n State Effective Date of RTC Law Fraction of Year In Effect Year of Passage RTC Date (Synthetic Controls Analysis)\n Alabama 1975 1975\n Alaska 10/1/1994 0.252 1995\n Arizona 7/17/1994 0.460 1995\n Arkansas 7/27/1995 0.433 1996\n California N/A 0\n Colorado 5/17/2003 0.627 2003\n Connecticut 1970 "| __truncated__
p_62 <- DAWpaper_p_62 |>
str_split("\n") |>
unlist() |>
as_tibble() |>
slice(-(1:2)) |>
rename(RTC = value) |>
slice(-c(53:54)) |> # physical page 60 marking; empty line removal
mutate(RTC = str_replace_all(RTC, "\\s{40,}", "|N/A|"),
RTC = str_trim(RTC, side = "left"),
RTC = str_replace_all(RTC, "\\s{2,15}", "|"))
head(p_62)# A tibble: 6 × 1
RTC
<chr>
1 State|Effective Date of RTC Law|Fraction of Year In Effect Year of Passage|RT…
2 Alabama||1975|N/A|1975
3 Alaska||10/1/1994||0.252|||1995
4 Arizona||7/17/1994||0.460|||1995
5 Arkansas||7/27/1995||0.433|||1996
6 California||N/A|N/A|0
p_62 <- pull(p_62, RTC) |>
str_split( "\\|{1,}") # split data on "|" symbol
# get the tibble!
p_62 <- as_tibble(do.call(rbind, p_62)) # rbind and not bind_cols here b/c we have no column names yet
colnames(p_62) <- c("STATE",
"E_Date_RTC",
"Frac_Yr_Eff_Yr_Pass",
"RTC_Date_SA")
p_62 <- p_62 |>
slice(-c(1, 53:nrow(p_62))) # remove unnecessary rows
RTC <- p_62 |>
select(STATE, RTC_Date_SA) |>
rename(RTC_LAW_YEAR = RTC_Date_SA) |>
mutate(RTC_LAW_YEAR = as.numeric(RTC_LAW_YEAR)) |>
mutate(RTC_LAW_YEAR = case_when(RTC_LAW_YEAR == 0 ~ Inf,
TRUE ~ RTC_LAW_YEAR))# combine after all that wrangling!
DONOHUE_DF <- bind_rows(dem_DONOHUE,
ue_rate_data,
poverty_rate_data,
crime_data,
population_data,
ps_data)
DONOHUE_DF# A tibble: 20,247 × 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Black_Male_15_to_19_years 1.55
2 1977 Alabama Black_Male_20_to_39_years 3.04
3 1977 Alabama Other_Male_15_to_19_years 0.0178
4 1977 Alabama Other_Male_20_to_39_years 0.0642
5 1977 Alabama White_Male_15_to_19_years 3.58
6 1977 Alabama White_Male_20_to_39_years 11.1
7 1977 Alaska Black_Male_15_to_19_years 0.163
8 1977 Alaska Black_Male_20_to_39_years 0.968
9 1977 Alaska Other_Male_15_to_19_years 1.12
10 1977 Alaska Other_Male_20_to_39_years 2.73
# … with 20,237 more rows
# to wide format!
DONOHUE_DF <- DONOHUE_DF |>
pivot_wider(names_from = "VARIABLE",
values_from = "VALUE")
# add in RTC data!
DONOHUE_DF <- DONOHUE_DF |>
left_join(RTC , by = c("STATE")) |>
mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
TRUE ~ FALSE)) |>
drop_na() # drop rows with missing information
# filter to only data where RTC laws were adopted between 1980-2010
# have crime data pre- and post-adoption this way
baseline_year <- min(DONOHUE_DF$YEAR)
censoring_year <- max(DONOHUE_DF$YEAR)
DONOHUE_DF <- DONOHUE_DF |>
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) |>
filter(RTC_LAW_YEAR > TIME_0)
# calculate violent crime rate; put population/crime on log scale
DONOHUE_DF <- DONOHUE_DF |>
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))
DONOHUE_DF |>
slice_sample(n = 10) |>
glimpse()Rows: 10
Columns: 20
$ YEAR <dbl> 2000, 1989, 1996, 1995, 1988, 1999, 2006, 19…
$ STATE <chr> "North Dakota", "Kentucky", "Florida", "Miss…
$ Black_Male_15_to_19_years <dbl> 0.0437679, 0.3637464, 0.6871117, 1.8943893, …
$ Black_Male_20_to_39_years <dbl> 0.1799001, 1.1552267, 2.2271226, 5.0746101, …
$ Other_Male_15_to_19_years <dbl> 0.35201231, 0.02912473, 0.07697420, 0.046566…
$ Other_Male_20_to_39_years <dbl> 0.9635169, 0.1212851, 0.3401490, 0.1684265, …
$ White_Male_15_to_19_years <dbl> 3.940669, 3.655086, 2.342130, 2.293715, 3.11…
$ White_Male_20_to_39_years <dbl> 12.866673, 14.588717, 11.249165, 9.212543, 1…
$ Unemployment_rate <dbl> 3.0, 6.4, 5.2, 6.2, 6.4, 4.0, 4.4, 5.5, 7.0,…
$ Poverty_rate <dbl> 10.4, 16.1, 14.2, 23.5, 17.3, 11.9, 12.8, 15…
$ Viol_crime_count <dbl> 523, 13302, 151350, 13560, 14179, 38111, 119…
$ Population <dbl> 642023, 3677287, 14426911, 2690788, 3167148,…
$ police_per_100k_lag <dbl> 250.7698, 195.1167, 437.2384, 267.7654, 251.…
$ RTC_LAW_YEAR <dbl> 1986, 1997, 1988, 1990, 1996, 1997, 2007, 19…
$ RTC_LAW <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE,…
$ TIME_0 <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 19…
$ TIME_INF <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 20…
$ Viol_crime_rate_1k <dbl> 0.8146126, 3.6173407, 10.4908112, 5.0394160,…
$ Viol_crime_rate_1k_log <dbl> -0.2050427, 1.2857391, 2.3504998, 1.6172902,…
$ Population_log <dbl> 13.37238, 15.11769, 16.48461, 14.80534, 14.9…
LOTT_DF <- bind_rows(dem_LOTT,
ue_rate_data,
poverty_rate_data,
crime_data,
population_data,
ps_data) |>
pivot_wider(names_from = "VARIABLE",
values_from = "VALUE") |>
left_join(RTC , by = c("STATE")) |>
mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
TRUE ~ FALSE)) |>
drop_na()
baseline_year <- min(LOTT_DF$YEAR)
censoring_year <- max(LOTT_DF$YEAR)
LOTT_DF <- LOTT_DF |>
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) |>
filter(RTC_LAW_YEAR > TIME_0)
LOTT_DF <- LOTT_DF |>
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))
LOTT_DF# A tibble: 1,364 × 50
YEAR STATE Black…¹ Black…² Black…³ Black…⁴ Black…⁵ Black…⁶ Black…⁷ Black…⁸
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1980 Alaska 0.264 0.443 0.201 0.116 0.0924 0.0264 0.297 0.695
2 1980 Arizona 0.287 0.278 0.165 0.119 0.136 0.103 0.311 0.338
3 1980 Arkans… 1.82 1.50 0.842 0.634 1.02 1.16 1.81 1.26
4 1980 Califo… 0.780 0.815 0.581 0.394 0.456 0.292 0.808 0.815
5 1980 Colora… 0.352 0.388 0.245 0.172 0.164 0.103 0.377 0.467
6 1980 Delawa… 1.87 1.68 1.14 0.783 0.952 0.670 1.81 1.36
7 1980 Distri… 6.53 7.54 5.18 3.89 6.10 4.15 6.32 6.40
8 1980 Florida 1.50 1.37 0.912 0.679 0.812 0.604 1.49 1.20
9 1980 Georgia 2.90 2.78 1.85 1.22 1.56 1.35 2.92 2.45
10 1980 Hawaii 0.0930 0.215 0.0776 0.0253 0.0197 0.00738 0.180 0.656
# … with 1,354 more rows, 40 more variables: Black_Male_30_to_39_years <dbl>,
# Black_Male_40_to_49_years <dbl>, Black_Male_50_to_64_years <dbl>,
# Black_Male_65_years_and_over <dbl>, Other_Female_10_to_19_years <dbl>,
# Other_Female_20_to_29_years <dbl>, Other_Female_30_to_39_years <dbl>,
# Other_Female_40_to_49_years <dbl>, Other_Female_50_to_64_years <dbl>,
# Other_Female_65_years_and_over <dbl>, Other_Male_10_to_19_years <dbl>,
# Other_Male_20_to_29_years <dbl>, Other_Male_30_to_39_years <dbl>, …
❓ Why are there different dimensions for LOTT vs DONOHUE??